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Abstract 

We analyse the distributions of the number of goals scored by home teams, away 
teams, and the total scored in the match, in domestic football games from 169 
countries between 1999 and 2001. The probability density functions (PDFs) of goals 
scored are too heavy-tailed to be fitted over their entire ranges by Poisson or negative 
binomial distributions which would be expected for uncorrelated processes. Log- 
normal distributions cannot include zero scores and here we find that the PDFs are 
consistent with those arising from extremal statistics. In addition, we show that it 
is sufficient to model English top division and FA Cup matches in the seasons of 
1970/71 to 2000/01 on Poisson or negative binomial distributions, as reported in 
analyses of earlier seasons, and that these are not consistent with extremal statistics. 
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1 Introduction 



Few authors have considered foot- 
ball scores from a statistical point of 
view. Moloney [1] showed that the 
numbers of goals scored by individual 
teams, and the total goal scores, were 
well described by a "modification of 
the Poisson"; Reep et al. [2] later 
identify this as the negative bino- 
mial distribution, and found similar 
results for other ball games. Maher 
[3] then pointed out that a negative 
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binomial distribution may arise from 
the aggregate of Poisson-distributed 
scores with a different mean for each 
team. The short-term predictabil- 
ity of results has subsequently been 
modelled using independent Poisson 
distributions with means dependent 
on teams' past performances [4,5]; 
an improved model [6] includes the 
scoring-rate dependence on both 
time and the current score. 



Other aspects of the game have been 
examined including the effects of cer- 
tain conditions on the scores - see [7] 
for a review. Seeking a broader un- 
derstanding, it has been suggested 
[8] that the distribution of goals per 
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player may be linked with anomalous 
diffusion via the Zipf-Mandclbrot 
law. In this paper we show, in agree- 
ment with analyses of matches from 
the 1960s [1,2], that 13, 000 En- 
glish top division and 5, 000 FA 
Cup matches between the seasons of 
1970/71 and 2000/01 [9] are closely- 
fitted by either Poisson or negative 
binomial distributions. However, we 
find that the number of goals scored 
by home and away teams, and the 
total goals, in over 135, 000 domes- 
tic football games (leagues and cups, 
hereafter referred to as domestic 
matches) from 169 countries between 
1999 and 2001 [10], cannot be fitted 
over their entire ranges by Poisson 
or negative binomial distributions. 
Instead, we find that the data can 
be modelled by extremal statistics 
(explained in Sect. 3). 

The ubiquity of power-law relation- 
ships in both nature [16] and the field 
of econophysics [17-20] has spawned 
a significant amount of literature in 
recent years. Intriguingly, extremal 
statistics in a global measure are 
found in turbulent fiuids and other 
highly-correlated systems [21-26]. 
Hence the significance and origin of 
extremal and power-law-tailed dis- 
tributions are currently of consider- 
able interest in statistical physics; 
the use of probability distributions 
in the modelling of complex systems 
is a topical approach to the inverse 
problem. From an operational per- 
spective, knowledge of the statistics 
would be an important constraint on 
any model for the game. 



2 The probability density 
function (PDF) 

The first step in our analysis of each 
data set is the construction of its 
PDF. The PDF P{x) of a variable X 
is defined such that the probability 
that X lies within a small interval 
dx centred on X = x, is equal to 
P{x)dx. P{x) is normalised so that 

/ P{x)dx = 1. (1) 

Here, x takes the integer values of 
goals scored {xmin = and Xmax is 
the maximum number of goals scored 
in the sample of matches) so Eq. 1 
becomes y]?;™'' P(x) = 1. We further 
normalise each PDF to the sample 
mean // and standard deviation a to 
enable comparison with extremal dis- 
tributions (see Sect. 3). 

The Poisson distribution is defined 

by 

P{x I = ^e-'^/(o,i,...)(x), (2) 

where I[o,i,...)ix) ensures that P{x) — 
for non-integer x. In the Poisson 
PDF n = a^; for data to be well- 
fitted by this distribution we require 
fjL ^ a'^. It is explained in [1,2] that 
this condition does not hold for foot- 
ball goals because a constant proba- 
bility per unit time of scoring a goal 
is not a valid assumption. Instead, a 
compound Poisson or negative bino- 
mial distribution is used, defined by 

(3) 
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where x is the number of goals scored 
with probabihty q per goal before r 
"failed goals" (probability p = 1 — g) 
have occured. The negative binomial 
PDF has /X = r(l - p) /p and cr^ = 
r(l — p)/p^\ fitting to data thus re- 
quires j.i/cy'^ = p < 1 and fj^p/{l —p) = 
r where we round r to the nearest in- 
teger. 



3 Extremal statistics 

Our data analysis presented in Sect. 4 
(Figs. 1 to 3) shows that the tails of 
the PDFs of goal scores in the do- 
mestic matches clearly deviate from 
both the Poisson and the negative bi- 
nomial distributions. Here, we com- 
pare the PDFs of the data with those 
arising from extremal statistics. We 
choose extremal distributions fitted 
over the entire dataset in preference 
to a piecewise fit of arbitrary func- 
tions as (1) they have been observed 
in a wide variety of natural systems; 
(2) they may suggest a physical inter- 
pretation, as they arise in situations 
where only the largest events are ob- 
served; (3) following normalisation of 
the data, only one parameter (a) re- 
mains to be estimated, and (4) un- 
like log-normal PDFs, they can be 
applied to data including zero values. 

The two limiting distributions of 
interest arc "Gumbel's asymptote" 
and Frechet [11-13]. In outline, the 
limiting distributions result from 
selecting the maximum value Xmax 
from each of a large number of large 
samples whose individual members 
are drawn from a distribution P{x). 
When P{x) decreases more rapidly 



than any power- law (as x oo), 
"Gumbel's asymptote" has the form 

Paixma.) = i^(e"-^")" (4) 
with u = b{x — s) 

where in the limit of an infinite num- 
ber of measurements a = 1; the con- 
stants K, b, and s arc fixed by nor- 
malisation as in Sect. 2 (see [14,15]). 
Selecting the second largest values 
from the same large samples pro- 
duces a PDF of the same functional 
form as Eq. 3 but with a = 2. 

Frechet distributions Ppixmax) arise 
in the same manner when the un- 
derlying PDF P{x) is power-law; the 
power-law tail of this underlying dis- 
tribution is preserved in the Frechet, 
thus lending itself to the fitting of 
heavy-tailed data. Mathematically, 
Ppi^max) can be defined by Eq. 3 
but with u = a + /31n(l + x/G), 
where K, a, and G are again fixed 
by normalisation, and (3 — {1 — a)~^ 
[15]. These curves exist for a > 1. 

A simple heavy-tailed distribution 
often encountered in nature is the 
log-normal. Log-normal distribu- 
tions with the same means and vari- 
ances as the datasets provide very 
poor models in all cases if scores 
of zero are included. The domestic 
home and away scores are quite well 
modelled by log-normal PDFs pro- 
viding zero scores are neglected. Al- 
ternatively, one goal can be added to 
all scores but, since the log- normal 
is not invariant under translation, 
the results are no more meaningful. 
Scores of zero occur frequently and 
should not be removed; we seek a 
single heavy-tailed PDF appropriate 
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for modelling integer data from zero 
upwards. 



4 Results 

As discussed in Sect. 1, the Poisson 
distribution has been demonstrated 
to be inferior to the negative bino- 
mial when modeUing football scores; 
only where this is not the case do 
we include a Poisson fit in Figs. 1- 
3. In Fig. 1 we show the PDFs of 
home team scores with their respec- 
tive negative binomial PDFs (fitted 
to n and a) along with the best-fit 
extremal distribution (see Sect. 3) 
for the domestic matches. While the 
league scores follow a negative bino- 
mial PDF, it is clear that the domes- 
tic scores arc better described by a 
Frcchet distribution beyond about 
yU. + 3(7 (a home score of about 6 
goals). Although the Cup scores are 
suggestive of some departure from a 
negative binomial PDF, we cannot 
quantify the functional form of this 
tail. Counting errors caused by bin- 
ning a finite dataset are omitted from 
Figs. 1-3 for the purpose of clarity; 
typical sizes of counting errors are 
indicated by fiuctuations around a 
smooth trend and become apparent 
in the final few bins. 

We plot the away team scores in 
Fig. 2. Again, the domestic scores 
are consistent with a Frechet distri- 
bution above /i + Aa (an away score 
of about 6 goals) whereas negative 
binomial PDFs suffice for the league 
and Cup scores if the last few points 
are discounted as explained above. 
The total goal scores with fitted neg- 



ative binomial PDFs are plotted in 
Fig. 3. Here we find that the domes- 
tic scores are consistent with a Gum- 
bel distribution (see Sect. 3) above 
fjL + Sa (9 goals), and that the league 
scores are more suggestive of a Pois- 
son than a negative binomial PDF. 

We now provide more detailed anal- 
ysis of the goodness of fit of the ex- 
tremal PDFs to the domestic scores. 
Figures 4-6 show in linear form the 
closeness of fit of various distribu- 
tions to the domestic data. To quan- 
tify whether the data are consistent 
with the fitted PDFs, one must es- 
timate the likely counting errors in 
the numbers of points in the bins 
(introduced by the finite size of the 
dataset). We are interested in the 
distribution of the number of points 
in a bin, given both the total number 
of data points and the probability 
that any point will lie in that partic- 
ular bin (given directly by the fitted 
PDF). A binomial distribution of 
counting errors is a reasonable esti- 
mate, and one can thus estimate the 
upper and lower limits of the number 
of points one would expect to find in 
any bin on 95% of occasions; these 
are plotted with dashed lines. The 
ending of a dashed line indicates that 
the lower/upper limit of the expected 
number of occurences of the corre- 
sponding score is zero; where both 
limits stop, no higher scores are ex- 
pected given the size of the dataset. 
From these plots we again conclude 
that a Frechet a=1.04 PDF is the 
best fit to domestic home scores, a 
Frechet a=1.10 PDF is the best fit to 
domestic away scores, and a Gumbel 
a=l PDF is the best fit to domestic 
total scores. 
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Fig. 1. PDFs of goals scored by home teams, normalised with respect to /i and a, 
showing how domestic matches are more closely fitted by a Frechet distribution than 
by a negative binomial. Coincident curves are plotted as a single line as indicated 
in the legend. 



The empirical PDF varies between 
the negative binomial and extremal 
PDF in each case; for low scores 
both the negative binomial and ex- 
tremal distributions provide satisfac- 
tory fits. However, as shown in the 
previous figures, there is a strong de- 
parture from the negative binomial 
on to heavier-tailed distributions for 
the higher scores. Our aim here is 
to identify a single distribution that 
fits the whole dataset rather than 
an arbitrary piecewise fit. The lat- 
ter could always be achieved given a 
sufficient number of independent dis- 
tribution functions to fit to different 



ranges of data, but would ultimately 
be less informative of the underlying 
processes. 

In this context it is important to 
note that the distribution of the ag- 
gregate of many thin-tailed datasets 
(i.e. the pooled data) is heavy-tailed 
if the variances of the component 
datasets differ [27,28]. Hence the 
heavy-tails seen in worldwide foot- 
ball results could arise simply from 
the aggregation of scores from many 
teams. Individual teams' scores may 
follow different Poisson distribu- 
tions, which when pooled produce 
countries' scores following negative 



5 




Fig. 2. PDFs of goals scored by away teams, normalised with respect to /j, and a, 
showing how domestic matches are more closely fitted by a Frcchet distribiition than 
by a negative binomial. Coincident curves are plotted as a single line as indicated 
in the legend. 



binomials, and then the aggregation 
of countries' scores is heavy-tailed. 
Testing this hypothesis would require 
significantly more data than used 
here, and would run over an interval 
of time that may imply significant 
changes in the game process. The 
alternative - and more interesting 
- possibility is that the heavy tails 
are the result of some inherent pro- 
cess that increases the likelihood of 
high scores over their Poisson-based 
expectations. 

We also find that both the English 
data and the worldwide domestic 
results show a mean goal difference 
(home score minus away score) of 



0.51, an aggregate home advantage 
(see [29]), and a bias towards uneven 
scores as the total score rises is evi- 
dent in the larger domestic dataset; 
these trends are well-known in the 
world of football. 

It is important to note that the ob- 
servation of a departure from nega- 
tive binomial distributions is not the 
result of a larger dataset for domes- 
tic matches. Whilst more rare events 
are observed in a larger sample and 
the distribution extends to higher 
values with lower probabilities, it is 
nevertheless possible to distinguish 
between the different distributions, 
as we have shown, without consid- 
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Fig. 3. PDFs of total goals scored, normalised with respect to jj, and a, showing 
how domestic matches are more closely fitted by a Gumbel distribution than by a 
negative binomial. Coincident curves are plotted as a single line as indicated in the 
legend. Domestic n = 2.9, a = 1.9; league /x = 2.6, a = 1.7; Cup n = 2.8, a = 1.8. 



ering these extreme values. We have 
looked briefly at other individual 
countries and find similar trends to 
those shown for English matches. 



5 Conclusions 

We have shown that the simplest 
models - the thin-tailed Poisson 
and negative binomial distributions 
based on the assumption of uncor- 
related processes - do not fit do- 
mestic (worldwide) football matches 
between 1999 and 2001 beyond the 
low scores. Heavier-tailed distribu- 



tions are required if these datasets 
are to be fitted with single PDFs. 
Log-normal distributions do not in- 
clude zero scores whereas extremal 
distributions can model the entire 
range of scores. Extremal distribu- 
tions have been observed in a variety 
of complex systems and our results 
may then inform the modelling of 
football games. 

In addition, using English top divi- 
sion and FA Cup matches in the sea- 
sons of 1970/71 to 2000/01, we con- 
firm the Poisson or negative binomial 
nature of English scores as reported 
in analyses of earlier football seasons. 
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Fig. 4. Normalised PDF of domestic home scores plotted against a range of fitted 
PDFs. Straight lines indicate where the points would lie were the fits perfect, and are 
separated by an arbitrary vertical displacement; dashed lines indicate 95% binomial 
counting errors. Note how the Frechet a=1.04 PDF (c) provides a superior fit to the 
gumbel a=l,2 (c,d), negative binomial (b), and Poisson (a) distributions; compare 
Fig. 1. 
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